Northeastern University in TREC 2009 Million Query Track

نویسندگان

Evangelos Kanoulas

Keshi Dai

Virgil Pavlu

Stefan Savev

Javed A. Aslam

چکیده

Ranking is a central problem in information retrieval. Modern search engines, especially those designed for the World Wide Web, commonly analyze and combine hundreds of features extracted from the submitted query and underlying documents in order to assess the relative relevance of a document to a given query and thus rank the underlying collection. The sheer size of this problem has led to the development of learningto-rank (LTR) algorithms that can automate the construction of such ranking functions: Given a training set of (feature vector, relevance) pairs, a machine learning procedure learns how to combine the query and document features in such a way so as to effectively assess the relevance of any document to any query and thus rank a collection in response to a user input. Much thought and research has been placed on the development of sophisticated learning-to-rank algorithms. However, relatively little research has been conducted on the construction of appropriate learningto-rank data sets nor on the effect of these data sets on the ability of a learning-to-rank algorithm to “learn” effectively. Given that the IR technology is ubiquitous in a vast variety of contexts and environments it is not unreasonable to assume that searchable material (corpora) and user information needs will radically vary from one retrieval environment to another. Theoretically, ranking functions should be trained over collections with similar characteristics as the collections they will be deployed in. However, the ability to construct different ranking functions for different retrieval environments is limited by the cost of constructing such customized training collections. Thus, the question that naturally arises is whether training on a collection of certain characteristics can still lead to an effective ranking function over collections of different characteristics. To answer this question we trained our ranking functions (by employing SVM) over two different collections, (a) the Million Query 2008 (MQ08) collection (GOV2 corpus and queries with at least one click on documents in the .gov domain), and (b) a Bing generated collection (described in Section 2.1) and employed the constructed ranking function over the Million Query 2009 (MQ09) collection (ClueWeb09 corpus and general web queries). Furthermore, even within a certain retrieval environment (represented by a given collection) different queries may have radically different characteristics and thus different features may better capture the notion of relevance. For instance, in the case of precision-oriented queries, such as homepage/namepage finding, the url of a document or its popularity may be more indicative of the document relevance than the document text itself, while for informational queries the document url maybe less indicative than its text. Most of the existing learning-to-rank approaches train a single ranking function to handle all queries. Hence, the question that arises is whether training a different ranking function for each one of these different query

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Axiomatic Approaches to Information Retrieval--University of Delaware at TREC 2009 Million Query and Web Tracks

We report our experiments in TREC 2009 Million Query track and Adhoc task of Web track. Our goal is to evaluate the effectiveness of axiomatic retrieval models on the large data collection. Axiomatic approaches to information retrieval have been recently proposed and studied. The basic idea is to search for retrieval functions that can satisfy all the reasonable retrieval constraints. Previous ...

متن کامل

University of Amsterdam and University of Twente at the TREC 2007 Million Query Track

In this paper, we document our submissions to the TREC 2007 Million Query track. Our main aim is to compare results of the earlier Terabyte tracks to the Million Query track. We submitted a number of runs using different document representations (such as full-text, title-fields, or incoming anchor-texts) to increase pool diversity. The initial results show broad agreement in system rankings ove...

متن کامل

IIIT Hyderabad at Million Query Track TREC 2009

This was our maiden attempt at Million Query track, TREC 2009. We submitted three runs for ad-hoc retrieval task in Million Query track. We explored ad-hoc retrieval of web pages using Hadoop—a distributed infrastructure. To enhance recall, we expanded the queries using WordNet and also by combining the query with all possible subsets of tokens present in the query. To prevent query drift we ex...

متن کامل

Million Query Track 2009 Overview

The Million Query (1MQ) track ran for the second time in TREC 2008. The track is designed to serve two purposes: first, it is an exploration of ad-hoc retrieval over a large set of queries and a large collection of documents; second, it investigates questions of system evaluation, in particular whether it is better to evaluate using many shallow judgments or fewer thorough judgments. Participan...

متن کامل

TREC 2009 at the University of Buffalo: Interactive Legal E-Discovery With Enron Emails

For the TREC 2009, the team from University at Buffalo, the State University of New York participated in the Legal E-Discovery track, working on the interactive search task. We explored indexing and searching at both the record level and the document level with the Enron email collection. We studied the usefulness of fielded search and document presentation features such as clustering documents...

متن کامل

Collection Selection Based on Historical Performance for Efficient Processing

A Grid Information Retrieval (GIR) simulation was used to process the TREC Million Query Track queries. The GOV2 collection was partitioned by hostname and the aggregate performance of each host, as measured by qrel counts from the past TREC Terabyte Tracks, was used to rank the hosts in order of quality. Only the 100 highest quality hosts were included in the Grid IR simulation, representing l...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Northeastern University in TREC 2009 Million Query Track

نویسندگان

چکیده

منابع مشابه

Axiomatic Approaches to Information Retrieval--University of Delaware at TREC 2009 Million Query and Web Tracks

University of Amsterdam and University of Twente at the TREC 2007 Million Query Track

IIIT Hyderabad at Million Query Track TREC 2009

Million Query Track 2009 Overview

TREC 2009 at the University of Buffalo: Interactive Legal E-Discovery With Enron Emails

Collection Selection Based on Historical Performance for Efficient Processing

عنوان ژورنال:

اشتراک گذاری